{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88\n",
      "  return f(*args, **kwds)\n",
      "/usr/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88\n",
      "  return f(*args, **kwds)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Current dir: /home/fernando.delacalle/flcalle@ing.uc3m.es/Documentos/WORK/financial/adv_financial_ml_book/notebooks\n",
      "Current dir: /home/fernando.delacalle/flcalle@ing.uc3m.es/Documentos/WORK/financial/adv_financial_ml_book\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd \n",
    "\n",
    "import matplotlib as mpl\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.gridspec as gridspec\n",
    "%matplotlib inline\n",
    "import seaborn as sns\n",
    "plt.style.use('seaborn-talk')\n",
    "plt.style.use('bmh')\n",
    "plt.rcParams['font.weight'] = 'medium'\n",
    "#plt.rcParams['figure.figsize'] = 10,7\n",
    "blue, green, red, purple, gold, teal = sns.color_palette('colorblind', 6)\n",
    "\n",
    "import os\n",
    "print(f\"Current dir: {os.getcwd()}\")\n",
    "os.chdir('..')\n",
    "print(f\"Current dir: {os.getcwd()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 7: Cross-Validation in Finance\n",
    "\n",
    "\n",
    "___\n",
    "\n",
    "## Exercises\n",
    "**7.1** Why is shuffling a dataset before conducting k-fold CV generally a bad idea in finance? What is the purpose of shuffling? Why does shuffling defeat the purpose of k-fold CV in financial datasets?\n",
    "\n",
    "**7.2** Take a pair of matrices (X, y), representing observed features and labels. These could be one of the datasets derived from the exercises in Chapter 3.\n",
    "- **(a)** Derive the performance from a 10-fold CV of an RF classifier on (X, y), without shuffling.\n",
    "- **(b)** Derive the performance from a 10-fold CV of an RF on (X, y), with shuffling.\n",
    "- **(c)** Why are both results so different?\n",
    "- **(d)** How does shuffling leak information?\n",
    "\n",
    "**7.3** Take the same pair of matrices (X, y) you used in exercise 2.\n",
    "- **(a)** Derive the performance from a 10-fold purged CV of an RF on (X, y), with 1% embargo.\n",
    "- **(b)** Why is the performance lower?\n",
    "- **(c)** Why is this result more realistic?\n",
    "\n",
    "**7.4** In this chapter we have focused on one reason why k-fold CV fails in financial applications, namely the fact that some information from the testing set leaks into the training set. Can you think of a second reason for CV’s failure?\n",
    "\n",
    "**7.5** Suppose you try one thousand configurations of the same investment strategy, and perform a CV on each of them. Some results are guaranteed to look good, just by sheer luck. If you only publish those positive results, and hide the rest, your audience will not be able to deduce that these results are false positives, a statistical fluke. This phenomenon is called “selection bias.”\n",
    "- **(a)** Can you imagine one procedure to prevent this?\n",
    "- **(b)** What if we split the dataset in three sets: training, validation, and testing? The validation set is used to evaluate the trained parameters, and the testing is run only on the one configuration chosen in the validation phase. In what case does this procedure still fail?\n",
    "- **(c)** What is the key to avoiding selection bias?\n",
    "\n",
    "\n",
    "\n",
    "## Code Snippets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from src.snippets.ch7 import getTrainTimes\n",
    "from src.snippets.ch7 import getEmbargoTimes\n",
    "from src.snippets.ch7 import PurgedKFold\n",
    "from src.snippets.ch7 import cvScore"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
